Honor Code: ”The codes and results derived by using these codes constitute my own work. I have consulted the following resources regarding this assignment:” (ADD: names of persons or web resources, if any, excluding the instructor, TAs, and materials posted on course website)

Web sources:
http://rexdouglass.com/mosaic-plots-with-percentage-labels/
StackOverflow
Piazza
R-Bloggers

Demographics of the Participants

As we can see from the density plot, many older generations do not have a college education while more young people pursue a college degree. The similar pattern appears when dividing the group into different race and sex.  

Analysis of Smoking Effects on Cancer Incidence and Death

Many previous studies show that smoking can cause cancer. Therefore, I want to justify this statement through my analysis of this data. I first analyze which age group smokes the most. As the first calculation shows, the average age of smokers is around 44 to 53. In addition, the boxplot shows that the median age of the current smokers are much lower than the former, unknown and nonsmokers. In addition, the boxplot is skewed to the left for the current smokers. Therefore, most of the observations are concentrated at the lower end of the scale. Therefore, both calculation and graphs indicate that more young adults become current smokers.

The average age of different kinds of smokers:

##   Current      Past Nonsmoker   Unknown 
##  44.63609  51.94214  51.20959  52.63226

Secondly, I want to check whether there is a gender difference in smoking. The first mosaic plot shows that the numbers of current and unknown female smokers are close to the males. There are more male past smokers and more current female non-smokers.

Thirdly, let's look at the smoking effects on the cancer incidence and death. The majority of people who do not have cancer incidence are current and non-smokers. We can also get the same conclusion for the ones who have cancer incidence. Moreover, we see the same trend for the smoking versus cancer death plot. Therefore, I conclude that smoking does not have a significant effect on the cancer incidence and death through this data. 
## Loading required package: grid

Lastly, I want to analyze how the patterns change across gender and race. I still find that the number of current and non-smokers dominate in both categories when I add sex or race as a factor when analyzing the relationship between the type of smokers and cancer incidence/death. 

Relationship between Continuous Variables

Firstly, I use scatterplot to find the relationship between all the continuous variables. I discover that BMI is only positively correlated with weight. Furthermore, I see the same pattern appears across different educational level, gender and race in the following three ggplots. 

## Warning: Removed 5868 rows containing non-finite values (stat_smooth).

## Warning: Removed 5868 rows containing non-finite values (stat_smooth).

## Warning: Removed 5868 rows containing non-finite values (stat_smooth).

Secondly, I notice that there is also a correlation between Transferin and TIBC. I use boxcox transformation to find the best lamda to transform the data. I plot the residual and qqplot to show that the data is normalized. Then, I use xyplots to find their relationships across different factors. I plot two graphs, which include the non-transformed and transformed data to show the effect of the transformation.
I find that there is an obvious correlation between these two variables for different types of smokers, cancer incidence and death. The correlation is close to negative linear. 
## Warning: Removed 1019 rows containing non-finite values (stat_smooth).

Lastly, I do a similar analysis as above for Serum Iron and Transferin. I discover that there is a positive correlation between these two variables for different types of smokers, cancer incidence and death. The correlation is not quite linear. 
## Warning: Removed 1019 rows containing non-finite values (stat_smooth).

## [1] 0.8158457

Code Appendix:
load("/Users/Chloechen/Downloads/NHANES.Rdata")
attach(NHANES)
NHANES$Ed = factor(NHANES$Ed, levels = c(0, 1), labels = c("no college edu", "college edu"))
NHANES$Race = factor(NHANES$Race, levels = c(0, 1), labels = c("Non-Caucasian", "Caucasian"))
library(lattice)
densityplot(~ Age | Ed, data = NHANES, main = "Age Group by College Education")
densityplot(~ Age | Ed + Race, data = NHANES, main = "Age Group by College Education and Race")
densityplot(~ Age | Ed + Sex, data = NHANES, main = "Age Group by College Education and Sex")
tapply(NHANES$Age, NHANES$Smoke, mean)
plot(Smoke, Age, main = "Scatter Plot of Smoke vs. Age", xlab = "Smoke", ylab = "Age" )
tab = prop.table(table(Smoke, Sex))
plot(tab, main = "Smokers by Sex")
## Use reference online
library(vcd)
table = table(Smoke, Cancer.Incidence)
table = table[,]
proportion = round(prop.table(table, 2) * 100)
values = c(table)
rowname = c("Current", "Past", "Nonsmoker", "Unknown")
colname = c("No", "Yes")
names = c("Type of Smokers", "Cancer Incidence")
dims = c(4, 2)

tabs = structure(c(values), .Dim = as.integer(dims), .Dimnames = structure(list(rowname, colname), .Names = c(names)), class = "table")

proportion = structure(c(proportion), .Dim = as.integer(dims), .Dimnames = structure(list(rowname, colname), .Names = c(names)), class = "table")

transproportion = structure( c(paste(proportion,"%","\n", "(",values,")",sep="")), .Dim = as.integer(dims), .Dimnames = structure(list(rowname, colname), .Names = c(names)), class = "table")

mosaic(tabs,pop=FALSE, main="Type of Smokers by Cancer Incidence")

labeling_cells(text=transproportion , clip_cells=FALSE)(tabs)


table = table(Smoke, Cancer.Death)
table = table[,]
proportion = round(prop.table(table, 2) * 100)
values = c(table)
rowname = c("Current", "Past", "Nonsmoker", "Unknown")
colname = c("No", "Yes")
names = c("Type of Smokers", "Cancer Death")
dims = c(4, 2)

tabs = structure(c(values), .Dim = as.integer(dims), .Dimnames = structure(list(rowname, colname), .Names = c(names)), class = "table")

proportion = structure(c(proportion), .Dim = as.integer(dims), .Dimnames = structure(list(rowname, colname), .Names = c(names)), class = "table")

transproportion = structure( c(paste(proportion,"%","\n", "(",values,")",sep="")), .Dim = as.integer(dims), .Dimnames = structure(list(rowname, colname), .Names = c(names)), class = "table")

mosaic(tabs,pop=FALSE, main="Type of Smokers by Cancer Death")

labeling_cells(text=transproportion , clip_cells=FALSE)(tabs)
library(lattice)
histogram(~ Smoke | Cancer.Incidence + Sex, data = NHANES, main="Type of Smokers by Cancer Incidence and Sex")
histogram(~ Smoke | Cancer.Incidence + Race, data = NHANES, main="Type of Smokers by Cancer Incidence and Race")

histogram(~ Smoke | Cancer.Death + Sex, data = NHANES, main="Type of Smokers by Cancer Death and Race")
histogram(~ Smoke | Cancer.Death + Race, data = NHANES, main="Type of Smokers by Cancer Death and Race")
library(car)
pairs(~ BMI + Weight + Diet.Iron + Albumin + Serum.Iron + TIBC + Transferin + Hemoglobin)

library(ggplot2)
p = ggplot(NHANES, aes(BMI, Weight)) + geom_point(na.rm = TRUE)
p2 = p + geom_smooth(method = "lm", se = FALSE) + ggtitle ("Correlation between BMI and Weight across Education Level")
p2 + facet_grid(Ed~ .)

p = ggplot(NHANES, aes(BMI, Weight)) + geom_point(na.rm = TRUE)
p2 = p + geom_smooth(method = "lm", se = FALSE) + ggtitle ("Correlation between BMI and Weight across Gender")
p2 + facet_grid(Sex~ .)

p = ggplot(NHANES, aes(BMI, Weight)) + geom_point(na.rm = TRUE)
p2 = p + geom_smooth(method = "lm", se = FALSE) + ggtitle ("Correlation between BMI and Weight across Race")
p2 + facet_grid(Race~ .)

p = ggplot(NHANES, aes(Transferin, TIBC)) + geom_point(na.rm = TRUE)
p + geom_smooth(method = "lm", se = FALSE)

library(MASS)
bx = boxcox(lm(Transferin ~ TIBC, data = NHANES), lambda = seq(-2, 2, by = 0.1))
lamda = bx$x[which.max(bx$y)]
lm = lm(Transferin^lamda ~ TIBC, data = NHANES, na.action=na.exclude)
res = resid(lm, na.action=na.exclude)
plot(NHANES$TIBC, res)
abline(h = 0, col = "red")
qqnorm(res)
qqline(res, col = "red")

xyplot(Transferin ~ TIBC | Smoke, type = c("p", "smooth"), col.line = "darkorange", lwd = 3, main = "Transferin vs TIBC by Type of Smokers")

xyplot(Transferin^lamda ~ TIBC | Smoke, type = c("p", "smooth"), col.line = "darkorange", lwd = 3, main = "Transformed Transferin vs TIBC by Type of Smokers")

xyplot(Transferin ~ TIBC | Cancer.Incidence, type = c("p", "smooth"), col.line = "darkorange", lwd = 3, main = "Transferin vs TIBC by Cancer Incidence")

xyplot(Transferin^lamda ~ TIBC | Cancer.Incidence, type = c("p", "smooth"), col.line = "darkorange", lwd = 3, main = "Transformed Transferin vs TIBC by Cancer Incidence")

xyplot(Transferin ~ TIBC | Cancer.Death, type = c("p", "smooth"), col.line = "darkorange", lwd = 3, main = "Transferin vs TIBC by Cancer Death")

xyplot(Transferin^lamda ~ TIBC | Cancer.Death, type = c("p", "smooth"), col.line = "darkorange", lwd = 3, main = "Transformed Transferin vs TIBC by Cancer Death")

p = ggplot(NHANES, aes(Serum.Iron, Transferin)) + geom_point(na.rm = TRUE)
p + geom_smooth(method = "lm", se = FALSE) + ggtitle ("Correlation between Serum Iron and Transferin")

bx = boxcox(lm(Serum.Iron ~ Transferin, data = NHANES), lambda = seq(-2, 2, by = 0.1))
lamda = bx$x[which.max(bx$y)]
lm = lm(Serum.Iron^lamda ~ Transferin, data = NHANES, na.action=na.exclude)
res = resid(lm, na.action=na.exclude)
plot(NHANES$Transferin, res)
abline(h = 0, col = "red")
qqnorm(res)
qqline(res, col = "red")
summary(lm)$r.squared 

xyplot(Serum.Iron ~ Transferin | Smoke, type = c("p", "smooth"), col.line = "darkorange", lwd = 3, main = " Serum Iron vs Transferin by Type of Smokers")

xyplot(Serum.Iron^lamda ~ Transferin | Smoke, type = c("p", "smooth"), col.line = "darkorange", lwd = 3, main = "Transformed Serum Iron vs Transferin by Type of Smokers")

xyplot(Serum.Iron ~ Transferin | Cancer.Incidence, type = c("p", "smooth"), col.line = "darkorange", lwd = 3, main = "Serum Iron vs Transferin by Cancer Incidence")

xyplot(Serum.Iron^lamda ~ Transferin | Cancer.Incidence, type = c("p", "smooth"), col.line = "darkorange", lwd = 3, main = "Transformed Serum Iron vs Transferin by Cancer Incidence")

xyplot(Serum.Iron ~ Transferin | Cancer.Death, type = c("p", "smooth"), col.line = "darkorange", lwd = 3, main = "Serum Iron vs Transferin by Cancer Death")

xyplot(Serum.Iron^lamda ~ Transferin | Cancer.Death, type = c("p", "smooth"), col.line = "darkorange", lwd = 3, main = "Transformed Serum Iron vs Transferin by Cancer Death")